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Abstract 

Background: The caleosin genes encode proteins with a single conserved EF hand calcium-binding domain and 
comprise small gene families found in a wide range of plant species. These proteins may be involved in many 
cellular and biological processes coupled closely to the synthesis, degradation, or stability of oil bodies. Although 
previous studies of this protein family have been reported for Arabidopsis and other species, understanding of the 
evolution of the caleosin gene family in plants remains inadequate. 

Results: In this study, comparative genomic analysis was performed to investigate the phylogenetic relationships, 
evolutionary history, functional divergence, positive selection, and coevolution of caleosins. First, 84 caleosin genes 
were identified from five main lineages that included 15 species. Phylogenetic analysis placed these caleosins into 
five distinct subfamilies (sub l-V), including two subfamilies that have not been previously identified. Among these 
subfamilies, sub II coincided with the distinct P-caleosin isoform recently identified in the pollen oil bodies of lily; 
caleosin genes from the same lineage tended to be clustered together in the phylogenetic tree. A special motif 
was determined to be related with the classification of caleosins, which may have resulted from a deletion in sub I and 
sub III occurring after the evolutionary divergence of monocot and dicot species. Additionally, several segmentally and 
tandem-duplicated gene pairs were identified from seven species, and further analysis revealed that caleosins of different 
species did not share a common expansion model. The ages of each pair of duplications were calculated, and most were 
consistent with the time of genome-wide duplication events in each species. Functional divergence analysis showed that 
changes in functional constraints have occurred between subfamilies l/IV, ll/IV, and HA/, and some critical amino acid sites 
were identified during the functional divergence. Additional analyses revealed that caleosins were under positive selection 
during evolution, and seven candidate amino acid sites (70R, 74G, 88 L, 89G, 100 K, 106A, 107S) for positive selection were 
identified. Interestingly, the critical amino acid residues of functional divergence and positive selection were mainly 
located in C-terminal domain. Finally, three groups of coevolved amino acid sites were identified. Among these coevolved 
sites, seven from group 2 were located in the Ga 2+ -binding region of crucial importance. 

Conclusion: In this study, the evolutionary and expansion patterns of the caleosin gene family were predicted, and a 
series of amino acid sites relevant to their functional divergence, adaptive evolution, and coevolution were identified. 
These findings provide data to facilitate further functional analysis of caleosin gene families in the plant lineage. 



Background 

Lipids in plant seeds are stored in specialized organelles 
termed oil bodies (OBs) to serve as an energy source for 
germination [1]. Seed OBs are composed of a core of neu- 
tral lipids, mainly triacylglycerols (TAGs), surrounded by a 
phospholipid (PL) monolayer embedded with integral pro- 
teins [2]. Only a limited number of proteins are specifically 
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associated with OBs; these proteins belong to three major 
classes comprising structural proteins (oleosins and caleo- 
sins), enzymes (e.g., steroleosins and lipases), and minor 
proteins (e.g., aquaporins) [3]. To date, oleosins, caleosins, 
and steroleosins have been identified in OBs of various 
plant species, including Arabidopsis, Sorghum bicolor, and 
Zea mays [4,5]. The most abundant structural proteins in 
plants are oleosins, which are basic proteins unique to 
plants and found in various organs [1,6]. Oleosins have 
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been demonstrated to play roles in the stability, synthesis, 
and metabolism of OBs [7]. 

Caleosin genes are widely distributed in true fungi, uni- 
cellular microalgae, and higher plants [8]. The first caleosin 
sequence was cloned and characterized in rice seeds ger- 
minating in response to abscisic acid treatment by Fradsen 
et al. in 1996 [9]. Subsequently, Caleosin was first identified 
to comprised a calcium-binding motif, and thus tentatively 
named caleosin by Chen et al [4], who also proposed a 
structural model for the protein based on a hydropathy plot 
and analysis of its secondary structure. The N-terminal 
hydrophilic domain of caleosin consists of an EF-hand 
calcium-binding motif of 28 residues including an invari- 
able glycine residue as a structural turning point and five 
conserved oxygen-containing residues as calcium-binding 
ligands [4,10]. The central hydrophobic domain of caleosin 
contains an amphipathic a-helix and an anchoring region. 
This amphipathic a-helix is assumed to be located at the 
interface between hydrophobic and hydrophilic environ- 
ments while the anchoring region is predicted to contain a 
pair of antiparallel (3-strands connected with a proline-knot 
motif. The C-terminal hydrophilic domain of the protein 
contains several potential phosphorylation sites [10-12]. 

Caleosin comprises a calcium-binding motif and sev- 
eral potential phosphorylation sites, that is, well-known 
candidates involved in signal transduction, and thus 
may possess biological function(s) in addition to its 
structural role for the stability of oil bodies [13]. Ac- 
cording to the characterization of two independent in- 
sertion mutants lacking caleosin, it was proposed that 
caleosin might play a role in the degradation of storage 
lipids in oil bodies by inducing the interaction of oil bodies 
with vacuoles during germination [14]. Putative interaction 
between oil bodies and vacuoles has also been observed in 
pollen cells after germination under electron microscopy; 
and the pollen oil bodies appeared to be surrounded by 
tubular membrane structures and encapsulated in vacuoles 
after germination [15,16]. Caleosin isoforms or caleosin-like 
proteins are not only localized in oil bodies but have also 
been found as membrane-bound proteins in other subcellu- 
lar fractions, such as the microsomal membrane; moreover, 
they were demonstrated to possess different biological func- 
tions, such as peroxygenase activity in biotic and abiotic 
stress responses in their phosphorylated forms [17-19]. 

Considering the economic significance of oil seeds, 
caleosin, as one of the primary OB-associated proteins, 
has been examined extensively [20,21]. However, the evo- 
lution of caleosins has not been studied in all plant line- 
ages; a comprehensive comparative genome study would 
improve understanding of the evolution and function of 
the caleosin family. In the present work, we identified 
all caleosin gene families in 15 plant species, repre- 
senting the major plant lineages with available genome 
sequences. We then performed phylogenetic analysis, 



exon/intron structural analysis, and motif analysis to trace 
the evolutionary history of the caleosin family in plants, 
and detected segmental duplication and tandem duplication 
events to gain insight into possible mechanisms for the ex- 
pansion of caleosin gene families. Analysis of functional di- 
vergence using bioinformatics software suggested that 
changes in selective constraints and amino acid properties 
occurred after gene duplication, which led to subfamily- 
specific functional evolution after the diversification of 
caleosins. Positive selected sites and coevolved sites were 
also predicted, and finally, three types of critical amino acid 
sites were proposed. 

Results 

Identification of caleosin genes 

To explore the origin and evolutionary history of the 
caleosin gene family, we identified caleosin genes from 15 
species representing the five major plant lineages: the 
green algae Chlamydomonas reinhardtii and Volvox car- 
ters the moss Physcomitrella patens; the lycophyte Sela- 
ginella moellendorffu; the monocotyledonous angiosperms 
Brachypodium distachyon, Oryza sativa, Setaria italica, 
Zea mays, and Sorghum bicolor; and the dicotyledonous 
angiosperms Arabidopsis thaliana, Citrus sinensis, Glycine 
max, Populus trichocarpa, Brassica rapa, and Cucumis 
sativus. We retrieved the available caleosin or caleosin-like 
genes using the Phytozome, JGI, TAIR, RAP, and BRAD 
databases, and BLASTP search results identified 101 
caleosin homologue genes that encode proteins containing 
the caleosin domain. The search results were further ex- 
amined using SMART and PFAM (PF05042) to confirm 
the presence of the conserved caleosin domain. Finally, we 
identified 84 caleosin genes from the above 15 species 
representing five major lineages (Additional file 1 and 
Additional file 2). 

Phylogenetic relationships and evolution of the caleosin 
gene family 

To investigate the evolutionary relationships of caleosins 
among various plant species, the 84 identified protein se- 
quences were used to execute multiple sequence alignment 
using Clustal X [22]. Based on this alignment, a rooted 
neighbor-joining (N-J) phylogenetic tree was constructed 
from the full-length protein sequences using MEGA5 [23]. 
To further confirm the topology of phylogenetic tree, a 
maximum likelihood (ML) phylogenetic tree was con- 
structed, which showed similar topology to the N-J tree 
with only minor modifications (data not shown). According 
to the topology and the deep duplication nodes of caleosin 
paralogs in the N-J tree, the caleosin gene families could be 
divided into five well-conserved subfamilies (Figure 1A, 
Additional file 3). We numbered these subfamilies sub I, 
sub II, sub III, sub IV, and sub V in the phylogenetic tree. 
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Figure 1 Phylogenetic relationships, exon-intron structure, and motif structures of caleosin genes. (A) The neighbor-joining (N-J) 
phylogenetic tree was constructed based on a complete protein sequence alignment of 84 caleosin genes identified using Clustal X and MEGA5. 
Numbers at the nodes represent bootstrap support (1000 replicates). The color of subclades indicates the five corresponding gene subfamilies. 

(B) Exon-intron structures of the caleosin genes. Boxes: exons; lines: introns. The lengths of boxes and lines are scaled based on gene length. 

(C) MEME motif search results. Conserved motifs are indicated in numbered color boxes. 



Our analysis demonstrated that caleosin genes from the 
same lineage tended to be clustered together in the phylo- 
genetic tree. Of the five subfamilies, sub V was only 
present in lower land plants (i.e., green algae and mosses), 
while subs I-IV were only present in higher land plants 



(i.e., angiosperms) with three exceptions (CrCLOl and 
CrCL02 were located in sub I, and SmCLOl was in sub 
IV). Moreover, subs I and II were found exclusively in 
monocots (Figure 1A), while sub III was only found in di- 
cots. These observations indicated that the caleosin genes 
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shared a common ancestor before the divergence between 
lower and higher land plants. We propose an evolutionary 
pattern for caleosins based on the phylogenetic tree: first, 
all caleosins shared a common ancestor and the lower 
plants (sub V) were the first to diverge. Sub IV included 
both monocots and dicots, which indicated that this sub- 
family diverged before the monocot-dicot split approxi- 
mately 200 MYA. The subsequent divergence between 
subs I, II, and III was thought to have occurred after the 
monocot-dicot split because all members of subs I and 
sub II were monocots while members of sub IV were ex- 
clusively dicots. 

We conducted an analysis of the exon-intron struc- 
ture of individual caleosin genes in all plant lineages 
using the online Gene Structure Display Server [24] to 
gain insight into possible mechanisms for the expansion 
of the multiple gene families. A detailed illustration of 
the relative length of introns and of conservation of the 
corresponding exon sequence within each caleosin 
paralog is provided in Figure 2B. Genes in the same sub- 
family had similar exon-intron structures. The number 
of introns did not differ significantly among the caleosin 
genes and ranged from four to six with the following ex- 
ceptions: CsCLOl from sub III contained three introns; 
PpCLOl from sub V had two introns, and CrCL03 from 
sub V contained 11 introns. 

A MEME (Multiple Em for Motif Elicitation) search 
[25] for conserved protein motifs flanking the caleosin 
genes was conducted (Figure 1C) to determine possible 
mechanisms for the structural evolution of caleosin genes. 
Six motifs were detected and proteins in the same subfam- 
ily shared similar numbers and patterns of conserved mo- 
tifs, The sequence logos for each motif, containing stacks 
of letters at each position, are shown in Additional file 4. 
Interestingly, the distribution of Motif 6 revealed obvious 
regularities that corresponded closely to the phylogenetic 
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Figure 2 Venn diagram of Type-I and Type-ll amino acid sites 
related to functional divergence, and positively selected sites. 

All sites are positioned on the reference sequence {AtCL05) based 
on sequence alignment. 



relationships: Motif 6 was absent from subs I and III, while 
all members of subs II, IV, and V, except for GrnCL07, con- 
tained this motif. Moreover, we observed that caleosin pro- 
tein sequences from the same species were divided into 
different subfamilies: the monocot members were separated 
in sub III and sub IV, while the dicot members were sepa- 
rated in subs I, II, and IV. Multiple alignment of each spe- 
cies was performed to explore the possible causes of this 
arrangement, and the results are shown in in Additional 
files 5 and 6 for dicots and monocots, respectively. In the 
figures, proteins from subs I-IV are colored red, green, yel- 
low, and blue, respectively. As may be seen, the presence or 
absence of motif 6 is strongly associated with the separation 
of members into different subfamilies. The distribution of 
Motif 6 was also consistent with the evolutionary pattern of 
caleosin that we proposed from the phylogenetic analysis. 
We infer from these combined data that the ancestor of 
caleosins may have contained Motif 6 and that all members 
of sub V still contained this motif after the first divergence, 
as Motif 6 was conserved in both lower plants and angio- 
sperms. Subsequently, after the second divergence before 
the monocot-dicot split, Motif 6 was conserved in all sub- 
IV members except for GmCL07. However, during the di- 
vergence between subs I-III, only the members of sub II 
retained Motif 6. 

Expansion analysis of the caleosin gene family 

As is known, duplication at the gene and genome levels is 
a widespread process that contributes to the evolution of 
biological novelty [26]. Here, we analyzed segmental and 
tandem duplications of caleosin genes to investigate the 
main expansion pathways of caleosins and the differences 
in these genes between monocots and dicots. We identi- 
fied 12 pairs of segmentally duplicated genes (Table 1) and 
nine pairs of tandem duplicated genes (Table 2). To esti- 
mate the approximate ages of the duplication events, syn- 
onymous base substitution rates (Ks values) and four-fold 
degenerate transversion site (4DTv) distances were esti- 
mated as a proxy for time. 

Our results showed that caleosin genes from different 
species did not share a common expansion model. Only 
tandem duplications were identified in the monocots 
Brachypodium and Sorghum; only segmental duplica- 
tions were identified in dicots Populus and Brassica; and 
both types of duplications were identified in rice, Arabi- 
dopsis, and soybean (Additional file 7). Moreover, no seg- 
mental duplication was identified in sub II and no tandem 
duplication identified in sub IV, while subs I and III in- 
cluded both types of duplications, suggesting that different 
subfamilies had different expansion pathways. 

Synonymous base substitution rates (Ks values) were esti- 
mated to calculate the approximate ages of segmental dupli- 
cation events (Table 1). The one pair occurring in monocots, 
OsCL02/OsCL06 in rice, was predicted to have diverged 
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Table 1 Estimates of the dates for the segmental duplication events of caleosin gene family 



Gene pairs KS (mean ± s.d.) Estimated time (mya) GWD (mya) References 



OsClo2 


OsClo6 


0.455 ±0.1 05 


35.0 


30-40 


Vandepoele et al., 2003 [27] 


GmGol 


GmClo6 


0.1 69 ±0.072 


13.9 


13, 59 


Schmutz et al., 2010 [28] 


GmClo5 


GmClo7 


0.1 63 ±0.055 


13.4 






AtClo8 


AtClo4 


0.71 2 ±0.1 82 


23.7 


28-48 


Ermolaeva et al., 2003 [32] 


AtClol 


AtClo2 


0.71 2 ±0.1 12 


23.7 






PtClol 


PtClo3 


0.348 ±0.144 


19.1 


8-13 


Tuskan et al., 2006 [34] 


BrClol 


BrClol 0 


0.810 ±0.048 


28.9 


13-17 


Town et al., 2006 [36] 


BrClo2 


BrClo4 


0.363 ± 0.798 


12.9 






BrClo3 


BrClo9 


0.453 ± 0.204 


16.2 






BrClo4 


BrClo5 


0.753 ± 0.059 


26.9 






BrClo4 


BrClo6 


0.400 ±0.1 60 


14.3 






BrClo8 


BrClo9 


0.353 ± 0.095 


12.6 







about 35 MYA during the genome-wide duplication (GWD) 
event [27], indicating that this gene pair was conserved dur- 
ing evolution after the GWD of rice. Previous studies have 
demonstrated that the soybean genome was complicated by 
at least two rounds of GWD at about 13 and 59 MYA 
[28,29]; thus, it is likely that the pairs GmCL01/GmCL06 
and GmCL05/GmCL07 were conserved after the second 
ancient large-scale duplication about 13 MYA. Analysis of 
Arabidopsis caleosin segmentally duplicated gene pairs 
(AtCL08/AtCL04 and AtCL01/AtCL02) suggested that the 
timing of their duplication (approximately 23.7 MYA) was 
consistent with the emergence of crucifers between 24-40 
MYA [30,31] but after its GWD, which occurred approxi- 
mately 28-48 MYA [32]. The GWD of poplar is thought to 
have occurred about 8-13 MYA [33] but segmental duplica- 
tions of poplar caleosin genes were estimated to have oc- 
curred earlier, about 19.1 MYA. Despite similar magnitudes 
of GWD, Arabidopsis contained more caleosin genes than 
poplar (eight and three, respectively), probably because the 
poplar genome subsequently experienced a high level of 
gene loss [34]. Among the dicots, Brassica rapa showed the 
largest number of segmental duplications (six), four of which 



Table 2 4DTv distance between paralogous genes 

Species Gene pairs D 4DTv value 



Brachypodium distachyon 


BdClo4 


BdClo5 


0.399 


Brachypodium distachyon 


BdClo6 


BdClo7 


0.414 


Sorghum bicolor 


SbClol 


SbClo2 


0.206 


Sorghum bicolor 


SbClo7 


SbClo8 


0.072 


Oryza sativa 


OsClol 


OsClo2 


0.229 


Oryza sotivo 


OsClo6 


OsClo7 


0.039 


Arabidopsis tholino 


AtClo8 


AtClo7 


0.235 


Arabidopsis thalina 


AtClo4 


AtClo5 


0.138 


Glycine max 


GmClol 


GmClo3 


0.189 



corresponded to a whole-genome triplication event that is 
thought to have occurred between 13 and 17 MYA [35,36]. 
We propose that the identified segmentally duplicated gene 
pairs were mostly conserved after the large-scale duplication 
event of each species during its evolution. In addition, the 
two genes of each duplicated pair belonged to the same sub- 
families, which suggested that they had not undergone evo- 
lutionary divergence after duplication. 

Genetic distance-transversion rates at 4DTV of 
tandem-duplicated caleosin gene pairs were calculated 
using the PAML software package [37]; 4DTV distance 
ranged from zero for recently duplicated peptides to -0.5 
for paralogs with an ancient evolutionary past. Among the 
nine tandem- duplicated pairs, only two (SbCL07/SbCL08, 
OsCL06l OsCLOT) appeared to have occurred in the re- 
cent past (4DTV values <0.1), while the others were con- 
served from more ancient times (Table 2). 

Functional divergence after gene duplication 

Two types of functional divergence (Type-I and Type- 
II) between gene clusters of the caleosin subfamily 
were estimated using DIVERGE2 [38], which evaluates 
shifts in evolutionary rate and altered amino acid proper- 
ties after gene duplication. Type-I functional divergence 
refers to evolutionary processes resulting in site-specific 
rate shifts after gene duplication, while Type-II results in 
site-specific property shifts. Each of the 81 caleosin pro- 
teins of the five subfamilies were used as input files in DI- 
VERGE, and the estimates were based on the multiple 
amino acid sequence alignments of caleosin proteins from 
any two subfamilies. The coefficients of Type-I functional 
divergence between subfamily pairs I/IV, II/IV, and II/V 
were statistically significant (6 > 0; likelihood ratio test 
statistic > 4.96; p < 0.01), indicating that significantly differ- 
ent site-specific shifts in evolutionary rate may have taken 
place at certain amino acid sites between these pairs. On 
the other hand, there was evidence of Type-II functional 
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divergence between four subfamily pairs (I/IV, II/III, II/IV, 
and II/ V), indicating a radical shift in amino acid proper- 
ties (Table 3). 

To identify critical amino acid sites that may be re- 
sponsible for functional divergence between caleosin 
subfamilies, the posterior probability (Qk) of divergence 
was determined for each site. Large values of Qk indicate 
a high probability that the evolutionary rate or physio- 
chemical amino acid properties of a site differ between 
two clusters. DIVERGE2 thus identified critical amino 
acid sites that are highly relevant to functional divergence 
(Table 3). To greatly reduce false positives, values of Qk > 
0.8, p < 0.01 were empirically used as a cutoff for identify- 
ing Type-I and Type-II functional-divergence-related resi- 
dues between gene subfamilies. Critical amino acid sites 
were identified in five groups of caleosin subfamilies for 
the analysis of Type-I functional divergence. Only two 
sites were identified between subs I/II, and one site be- 
tween subs I/V. In contrast, the numbers of amino acid 
sites identified between subs I/IV, II/IV, and II/V were 36, 
38, and 38, respectively. Fewer sites were identified as re- 
sponsible for Type-II than for Type-I functional diver- 
gence, with 7, 8, 8, and 6 sites between subs I/IV, II/III, II/ 
IV, and II/V, respectively (Table 3, Additional file 8). Inter- 
estingly, all of the Type-II sites for each subfamily group 
belonged to the corresponding Type-I sites. In other 
words, all of the detected Type-II sites of functional diver- 
gence that underwent site-specific property shifts also 
underwent site-specific rate shifts (i.e., Type-I functional 
divergence) (Figure 2, Additional file 8). 

Additional analyses were performed to investigate the 
critical amino acid sites responsible for functional diver- 
gence. Their relationships to four motifs supposed to be 
functionally important (Ca 2+ binding motif, Proline-knot, 
Tyr kinase-Pi, and CKII-Pi) [4] were identified: three sites 



coincided with the proline-knot motif, eight sites coincided 
with Tyr kinase-Pi, and two sites were located in the CKII- 
Pi motif (Figure 3, Additional file 8). The majority of the 
amino acid sites (29 of 38) were distributed in the C- 
terminal domain of caleosin. While the functions of these 
sites need to be experimentally verified. Thus, the results of 
the functional divergence analysis suggest that, because of 
the different evolutionary rates predicted at some amino 
acid sites, the caleosin genes may be significantly divergent 
from each other in their functions. Mutations in amino 
acids may have caused the caleosin gene family to evolve 
new functions after divergence, and hence, functional diver- 
gence might reflect the existence of long-term selective 
pressures. 

Variable selective pressures among amino acid sites 

To analyze positive selection of specific amino acid regions 
within full-length caleosin protein sequences, substitution 
ratios for nonsynonymous and synonymous mutation rates 
(dN/dS = (o, also known as KalKs) were calculated. The 
"codeml" program of the PAML4 software package [37] 
was utilized to test the hypothesis of positive selection in 
the caleosins. The estimation of positive selection was 
based on the tree topology shown in Figure 2, and two pairs 
of models (M0/M3 and M7/M8) were selected and com- 
pared with the site-specific codeml model to test whether 
variable co ratios occurred at amino acid sites. The param- 
eter estimates and log-likelihood values for each model are 
provided in Table 4 [39]. 

In the first model pair (M0/M3), the discrete model (M3) 
was significantly better than the one-ratio model (MO), with 
likelihood-rate test statistic 2A£ = 127.29, which greatly 
exceeded the critical value of 13.28 with df = 4 (p < 0.01). 
Thus, M0 was rejected, indicating extreme variation in se- 
lection pressure among amino acid sites. This analysis 



Table 3 Functional divergence between groups of the caleosin subfamily 



Groupl Group2 Type-I Type-II 





e, ± sec 


LRT 


Q k >0.8 


Q k >0.9 


On ± sec 


Q k >0.8 


1 II 


0.41 8 ±0.1 88 


4.96* 


2 


0 


-0.078 ± 0.449 


0 


1 III 


0.532 ± 0.477 


1.25 


0 


0 


-0.278 ± 0.436 


0 


1 IV 


0.914± 0.156 


34.4** 


36 


24 


0.054 ±0.448 


7 


1 V 


0.41 8 ±0.21 9 


3.64 


1 


1 


-0.343 ± 0.459 


0 


II III 


0.386 ± 0.332 


1.36 


0 


0 


0.1 12 ±0.241 


8 


II IV 


0.993 ±0.1 73 


32.8** 


38 


38 


0.088 ± 0.342 


8 


II V 


0.967 ±0.3 16 


9.39** 


38 


38 


0.027 ± 0.283 


6 


III IV 


0.001 ±0.022 


0.00 


0 


0 


-0.1 92 ±0.324 


0 


III V 


0.253 ± 0.287 


0.78 


0 


0 


-0.212 ±0.245 


0 


IV V 


0.242 ± 0.205 


1.40 


0 


0 


-0.1 63 ±0.339 


0 



Note: 0| and 0 M; the coefficients of Type-I and Type-II functional divergence between two gene clusters; LRT, Likelihood Ratio Statistic, for p < 0.05 was marked by 
*p < 0.01 was marked by **Q k/ posterior probability. Large Q k value indicates a high possibility that the functional constraint (or the evolutionary rate) or the 
physicochemical properties of a given amino acid site is different between two clusters. 
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AtCLOl 
AtCL02 
AtCL03 
AtCL04 
AtCL05 
AtCL06 
AtCL07 
AtCL08 



AtCLOl 
AtCLOl 
AtCL03 
AtCL04 
AtCL05 
AtCL06 
AtCLOl 
AtCLOH 



AtCLOl 
AtCLOl 
AtCL03 
AtCL04 
AtCL05 
AtCL06 
AtCL07 
AtCLOH 



1 MGSKTEMMER DAMATVAPYA PVTYHRRARV 

1 MTS-MERMER DAMETVAPYA RVTYHRRVRG 

1 MAG EA EALATTAPLA PVTSQRKVRN 

1 MA SSISTG-VKF 

1 MA SSISAAEVKV 

1 MST--DHARA D S I ATVAEKA PVTAERRVRT 

1 MW NALKRLPLFP PSPKEDK-QL 

1 MS HQTVALASKA KSPKPKRGKL 



61 
60 
5 6 
15 
16 
59 
25 
26 



| Motif 6 | 

DLDDRLPKPY MPRALQAPDR EHPYGTPGHK 
DLDDTLPKPY LPRALQAPDM EHPQGTPDHR 
DLEETLPKPY MARALAAPDT EHPNGTEGHD 

VPEE 

VPEE 

DLDDRLPKPY VPRAMVAPDM ENVNGTRGHK 

RKER 

DKEK 



NYGLSVLQQH 
HNGLSVLQCJH 
SKGMSVMQQH 

DNFLQPH 

YNFLQKH 

HRDMSVLQQH 

LHWRNWC 

MTALEKjH 

*★ ★ 

proline-knot 



Ca 2+ binding motif 

VSFFDIDDNG IIYPWETYSG 
VAFFDLDNNG IIYPFETFSG 
VAFFDQNDDG IVYPWETYKG 
VAFFDRNKDG IVYPSETFQG 
VAFFDRNKDG IVYPSETFQG 
IAFFDQDGDG IIYPSETFRG 
PSLTETVTAL FIHEKPT-KG 
VSFFDRNKDG TVYPWETYQG 
★ ★ ★ ★ 

Tyr Kinase-Pi 



central helix 



LRMLGFNIIG SLIIAAVINL 
FRLLGFNLLA SLILAAGINI 
FRDLGFNPIS SIFWTLLINL 
FRAIGCGYLL SAVASVFINI 
FRAIGCGYLL STFAAVFINI 
FRALGFNLVS SIFLTIIVHL 
FRALGTGRFM SAFVAVFFNM 
FRALGTGRLL AAFVAIFINM 



121 LPSPFFlPllYI 
120 LPSPFFPIYI 
116 VPSPLLP^YI 

73 GFSIWFPIEV 

74 GFSFSFPIEV 



119 MPSPTFPIYI 



82 
84 



LFGYILPLFL 
GFSPLF|pJlDV 
A AAAA 



HNIHKSKJHJ S 
HNIHKAKHGS 
DNIHKAKHGS 
KNIHLAKHGS 
KNVRLGIHSS 
KNIHRAKH3S 
K-PFVCTVVT 
KNSHLCMIPS 
AA AAAAAAA 



DSKTYDNEGR 
DSKTYDNEGR 
DSSTYDTEGR 
DSGVYDKDGR 
DSGVYDKDGR 
DTSTYDTEGR 
-TDVYDKDGR 
DTDVYDDDGR 
A AAAAAAAA 
• 



FMPVNLELIF 
YTPANLELMF 
YVPVNLENIF 
FVASKFEEIF 
FVASKFEEIF 
YIPANLENMF 
FVESKFEEIF 
FVESKFEEIF 
AAAAAAAAAA 



CK II- 

SKYAKTLPDK 
SKYARTIPDK 
SKYALTVKDK 
TKHAHTHRDA 
AKHAHTHRDA 
SKYARTVPDK 
NKHARTHKDA 
NKHARTHKDA 
A A 
CK Il-Pi 



TLSYATIPGW 
ALSYATIPGW 
AFSYVTIPSW 
GLSSKTRPGK 
SLSSKTRPGK 
TMSYATI PTW 
GLSQKTRPVQ 
GLSKKTRgGK 
AA 

Pi 



LSLGELWEMT 
LSLGELWDMT 
LSFKEVWNVT 
LTNEELKQLL 
LTSKELKELL 

LTL 

LTAKEIKQML 
LTAEEIQKML 
A 



AtCLOl 
AtCLOl 
AtCL03 
AtCL04 
AtCL05 
AtCL06 
AtCL07 
AtCL08 



181 EGNRDAWDIF 

180 EGNRDAFDFF 

17 6 EGNRMAIDPF 

133 KANKE PNDRK 

134 KANREPNDCK 

171 

14 0 KTNREPYDFI 

14 4 KTNRDPFDIT 



GWIAGKIEWG 
GWLASKVEWG 
GWLSNKVEWI 
GWLAGYTEWK 
GGILAFGEWK 
— AASKMEWG 
GWLSDFIEWK 
GWLSDYGEWK 



LLYLLARDEE 
VLYALASDEE 
LLYILAKDED 
VLHYLCKDKN 
VLYNLCKDKS 
VLYFLAKDEN 
ILHTLAQD-N 
ILHTLAQDKN 



GFLSKEAIRR 
GFLSKEAIRR 
GFLSKEAVRG 
GLLHKDTVRA 
GLLHKEIVRA 
GHLSKEAVRR 
GLLTEDAVRG 
GLLSEKSVRA 



aFDGSLFEYp 
CETDGSLFEYS 
CFDGSLFEQI 
AXDGSLFEKL 
ViTDGSLFEQL 
CFDGSLFDYC 
VYDGSLFQQL 
IpfDGSLFHQ[L 
★ 



AKIYAGISED 
AKNYAEIKEY 
AKERANSRKQ 
EKQRSSKTSK 
EKQRSSKTP- 
AKARASTKKT 
EKKRSSSSSR 
EKKRSSSSSR 



AtCLOl 241 KTAYY-- 

AtCLOl 24 0 KTYY 

AtCL03 236 D 

AtCL04 193 KHP 

AtCLOS 192 

AtCL06 220 E 

AtCLOl 199 GKKQKLP 
AtCLOS 2 04 GKKQKLP 

Figure 3 Multiple sequence alignment of Arabidopsis caleosin protein sequences. The positions of the motif 6, a calcium-binding motif, a 
proline knot-like motif, and four phosphorylation sites (one tyrosine kinase and three casein kinase II phosphorylation sites) are indicated on the 
tops of the sequences. The two invariable proline residues in the proline knot-like motif are labeled by black boxes, two His sites and two Cys 
sites of dissulfide bridge are highlighted by red and blue frames, respectively. The consensus sequence of the phosphorylation sites are marked 
by upper lines with the potential phosphorylated residue pointed by arrows. The critical amino acid sites of functional divergence, adaptive 
selection and coevolution are marked by red triangles, blue circles, and purple stars, respectively. 



suggested that the caleosin sequences were under positive 
evolutionary selection. We performed additional tests using 
the M7 (beta) and M8 (beta and co) models to detect amino 
acid sites that were under adaptive evolution. The likeli- 
hood rate test statistic 2A£ = 1922.25 greatly exceeded the 
critical value, and seven candidate amino acid sites for posi- 
tive selection were identified (70R**, 74G**, 88 L**, 89G, 
100 K** 106A, 107S**) based on BEB analysis of M8 (••, 
significant at p < 0.01). 

We observed relationships between amino acid sites under 
positive selection and functional divergence (Figure 2). Sites 



70R, 88 L, and 106A were under positive selection as well as 
Type-I and Type-II functional divergence; Sites 89G, 100 K, 
and 107S were under both positive selection and Type-1 
functional divergence. Additionally, the seven sites were also 
compared to the important motifs; two sites was identified 
in the proline-knot motif, the other 5 sites were all located in 
the C-terminal domain (Figure 3). 

Coevolution of caleosin amino acid sites 

Testing coevolution between sites is an essential comple- 
ment to molecular-selection analysis, since methods 
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Table 4 Log-likehood values and parameter estimates of caleosins under site-specific models 



Model 


InL 


2A£ 


Estimates of parameters 


Positively selected sites 


MO (one-ratio) 


-4930.58 




00 = 0.12741 


None 


M3 (discrete) 


-4866.93 


127.29** (M0vsM3) 


Po = 0.05223, p! = 0.37093 p 2 = 0.57684, 
oo 0 = 0.00000 00t =0.06189, oo 2 = 0.1 9751 


None 


M7 (beta) 


-4855.75 




p= 1.18755, q = 7.1 1704 


Not allowed 


M8 (beta&co) 


-5816.87 


1922.25** (M7vsM8) 


Po = 0.99999, p = 0.95936 q = 1 .53716, 
Pt =0.00001 00 = 2.94912 


70R** 74G** 88 L** 89G 100K** 106A 107S** 



Note: p < 0.05 was marked by *p < 0.01 were marked by **; 

w: The substitution rate ratios of non-synonymous (dN or Ka) versus synonymous (dS or Ks) mutations. 
Codon (amino acid) positions presented above are based on the AtCL05 gene. 



designed to detect adaptive evolution based on Bayesian 
approaches do not account for evolutionary interdepend- 
ence between protein residues, which would providing 
more biologically realistic results. We used coevolution 
analysis using protein sequences (CAPS), which is signifi- 
cantly more sensitive than other methods, to analyze co- 
evolution in caleosins [40]. Identification of coevolved 
residues can provide insight into pairs of amino acid sites 
within a protein whose evolution is linked by structural, 
functional, or interacting constraints. 

Three groups of amino acid sites (groups 1-3) were 
predicted by their hydrophobicity and molecular weight 
to spatially contact each other and to have revolution- 
ary significance. The largest was group 2, which con- 
sisted of five sites (23H, 26 F, 28D, 30 N, and 32D); 
groups 1 and 3 contained only two sites: 17Y and 18 N 
(group 1) and 161D and 176D (group 3). Comparison of 
the sites to the important motifs revealed that the five 
group-2 sites coincided with the Ca 2+ -binding motif, 
which was highly important to caleosin (Figure 3). 

Discussion 

Evolution of the caleosin gene family 

In this study, we identified 84 caleosin genes from 15 
species representing 5 main plant lineages by compara- 
tive genomic analysis. The caleosin genome sequence 
provides a large amount of data that can be used to ex- 
plore functional diversity from multiple perspectives. A 
N-J phylogenetic tree including 84 distinct protein se- 
quences clearly demonstrated that these genes could be 
divided into five subfamilies (Figure 1A). A ML phylo- 
genetic tree was also constructed, which showed similar 
topology to the N-J tree with only minor modifications. 
This classification was further supported by the results 
of motif and exon/intron analyses. Hanano et al. pro- 
posed a classification based on evolutionary relationship 
in which the caleosin genes were divided into three dis- 
tinct classes: class I, class II, and class III, which coin- 
cided with sub IV, sub III, and sub V, respectively [17]. 
Moreover, in our research we identified anther two new 
subfamilies in the present study: sub I and sub II, the 
members of which members are monocots (with two 



exceptions) and are predicted to have divided from the 
other subfamilies after the monocot-dicot split according 
to the proposed evolutionary pattern of caleosin. Previ- 
ous research has detected a caleosin in lily pollen oil 
bodies that appears to be a unique isoform distinct from 
that in lily seed oil bodies [15]. Sequence alignment and 
phylogenetic tree analysis have indicated that this lily 
pollen caleosin and a putative rice pollen caleosin may 
represent a class distinct from the caleosin found in seed 
oil bodies. By searching the accession number of the puta- 
tive rice pollen caleosin used in this previous study, we de- 
termined that it exactly matched the OsCL03 employed in 
our research, which was classified into sub II. Using 
BLASTP we determined that proteins in other species that 
were most similar to the lily pollen caleosin were all classi- 
fied into sub II (BdCL03, SiCLOZ ZrnCLOl, and SbCL02). 
In light of these results, we hypothesized that the distinct 
class in pollen discovered by previous research may belong 
to the same subfamily as the sub II detected in this study. 
Further work is required to determine if the other members 
of sub II are also expressed in pollen. 

Segmental duplication, tandem duplication, and trans- 
position events such as retro- and replicative transposition 
are the three main forces that drive the expansion of gene 
families [41,42]. Transposition events are difficult to iden- 
tify based on sequence analysis; therefore, we focused on 
segmental and tandem duplication events (Tables 1 and 2). 
In this paper, the clustering distribution of caleosin genes 
(Additional file 7) revealed that tandem duplication played 
an additional role in determining the current size of the 
caleosin gene family in monocots, while segmental dupli- 
cation is likely to have played a pivotal role in caleosin 
gene expansion in dicots. Moreover, age estimation of the 
OsCLO and GmCLO genes indicated that divergence 
within caleosin gene pairs occurred during the period of 
large-scale duplication events (Table 1). These observa- 
tions suggest that large-scale duplication may also have 
been involved in the expansion of the caleosin gene family 
in rice and soybean. 

The above analysis reveals that the caleosin gene family 
originated from a common ancestor, followed by lineage- 
specific expansion and divergence in each lineage and 
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species during its evolution. Moreover, lineage-specific ex- 
pansion, primarily by tandem duplication in monocots 
and by segmental duplication in dicots, is likely to have 
contributed to the size of the caleosin gene family. Large- 
scale duplication may also have been involved in the ex- 
pansion of the caleosin gene family in rice and soybean. 

The identification of Motif 6 

We identified 6 motifs in the present work, and the dis- 
tribution of Motif 6 was strongly related with the classi- 
fication of caleosin genes into subfamilies. Motif 6 was 
found in all members of subs II, IV, and V, with only one 
exception (probably due to the software parameter set- 
tings). Conversely, none of the sub I or sub III members 
contained Motif 6. Furthermore, the multiple alignments 
of sequences from the same species groups shown in 
Figure 3, and Additional file 5 and Additional file 6 indi- 
cated that the presence of Motif 6 played an important 
role in the classification of caleosins. 

Previous researches have reported that rice embryo 
caleosin possesses an additional N-terminal appendix of 
approximately 40-70 residues [11,43,44]. Chen et al. 
found that the molecular mass of caleosin in the oil bodies 
of rice embryos is 32 kDa with an N-terminal appendix of 
42 residues, as compared with the 27 kDa sesame caleosin. 
The sequence alignment of rice, adlay, maize, barley, ses- 
ame, and soybean in that study indicates that caleosins 
from monocot species possess an N-terminal appendix of 
40-70 residues absent in those from dicot species. By 
searching the accession numbers of the sequences aligned 
in the previous research, we identified that the sequences 
for rice, maize, and soybean were OsCL04, ZrnCL08, and 
GmCLOS, respectively. The sites, locations, and lengths of 
the N-terminal appendix identified for rice and maize in 
the previous work were similar to those of motif 6 in our 
research. However, we detected the presence of motif 6 in 
GmCLOS (which was not previously found); this result 
may have been due to the longer sequence of GmCLOS 
employed in the present study (50 residues more than the 
previous research) and the use of 84 alignment sequences 
(as opposed to only six in the previous work). We there- 
fore hypothesize that motif 6 coincides with the previously 
discovered N-terminal appendix. On the basis of the align- 
ment of six sequences, Chen et al. (2012) speculated that 
caleosins in monocot seed oil bodies possess an additional 
N-terminal appendix and are thus larger than those in 
dicot seed oil bodies. In our research, we observed that 
the monocot caleosins in sub I did not contain motif 6, 
while those in sub II and sub IV all possessed the motif. 
Likewise, for dicot caleosins, members of sub III did not 
contain motif 6, while those in sub IV all contained the 
motif (with one exception). Concerning the evolutionary 
pattern of caleosin based on the phylogenetic analysis, we 
suggest that the absence of motif 6 in sub I and sub III 



may have been the consequence of an evolutionary dele- 
tion occurring after the divergence of monocots and dicots. 
It remains unclear how the highly conserved caleosin iso- 
forms execute several diverse functions that may require a 
well-structured active site for enzymatic reaction and a 
well-featured binding surface for specific protein-protein 
interaction. We speculate that motif 6 may be related to 
the diverse functions of caleosins, but this hypothesis re- 
quires further investigation. 

Functional divergence and positive selection in the 
caleosin gene family 

Type-I and Type-II functional divergence between gene 
clusters of caleosin subfamilies were estimated by DI- 
VERGE analysis [38,45]. Type-I functional divergence re- 
sults in site-specific rate shifts, a typical example of 
which is an amino acid residue that is highly conserved 
in a subset of homologous genes and highly variable in a 
different subset of those genes [38,46] . Type-II functional 
divergence results in changes in cluster-specific amino acid 
properties [47,48]. Such divergence is exemplified by radical 
shifts such as positive versus negative charge differences at 
a homologous site that is otherwise evolutionally conserved 
between subfamilies within a phylogeny [49]. Significant 
differences in Type-I functional divergence between sub- 
family pairs indicated that different site-specific shifts in 
evolutionary rate may have occurred at certain amino acid 
sites, while observed Type-II divergence indicated a radical 
shift in amino acid properties. The analysis showed that 
functional divergence are mainly between subs I/IV, II/III, 
II/IV, and II/V. Among them, sub II seems to have been di- 
verged in function with both subs III, IV, and V. The results 
are also consistent with the conjecture discussed in the 
above that members of sub II might belong to the same 
subfamily with the pollen caleosins, thus caleosins in pollen 
might have diverged in function with that in seed oil bodies. 
Moreover, the critical amino acid sites of Type-I and Type- 
II functional divergence may be related to the specific func- 
tion of caleosin class in pollen which need further study. 

Positive selection has been reported to be associated with 
gene duplication and functional divergence. To explore 
whether positive selection drove the evolution of the caleo- 
sin gene family and whether selective constraints affected 
caleosin genes after duplication, it must be determined 
whether the rate of relaxation accelerated and if any amino 
acid residues were under positive selection. The results of 
positive selection analysis provided evidence for adaptive 
evolution, and seven sites (70R, 74G, 88 L, 89G, 100 K, 
106A, 107S) were predicted to have undergone positive se- 
lection. Interestingly, six of the seven sites (all but 74G) 
were also identified to have been under Type-I functional 
divergence, while 70R, 88 L, 106A were under positive se- 
lection and under Type-I and Type-II functional divergence 
(Figure 2). Gene duplication may result in altered functional 
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constraints between the gene clusters of a gene family. The 
results of the functional divergence analysis suggested that 
caleosin genes should be significantly functionally divergent 
from each other, especially with respect to the three amino 
acid residues (70R, 88 L and 106A) identified by both 
PAML 4 and DIVERGE 2.0 analyses, which were inferred 
to have played important roles during evolution. 

Coevolution in the caleosin gene family 

Protein evolution depends on intramolecular revolution- 
ary networks, the complexity of which is proportional to 
the underlying functional and structural interactions among 
sites [40]. The more complex the revolutionary network is 
for a particular site, the greater the selection coefficient 
may be against a mutation at that site due to the dramatic 
effect that this mutation would have on other coevolving 
protein regions. Testing for coevolution between sites is 
thus an essential step to complement molecular selection 
analysis and to provide more biologically realistic results. 

Coevolution analysis of caleosin proteins detected three 
groups of amino acid sites. Interestingly, further analysis 
determined that the five sites of group 2 were distributed 
in the Ca 2+ -binding motif (Figure 3). The calcium-binding 
capacity of caleosins has been demonstrated by Takahashi 
et al. [50] for AtCL03 (encoded by AT2G33380) isolated 
from A. thaliana and by Chen et al. [4] for caleosin from 
Sesamum indicum. Calcium ions are capable of promoting 
aggregation of purified OBs and are predicted to play an 
important role in OB aggregation and regulation during 
biogenesis processes [8]. This observation suggests that 
during the evolution of caleosin, sites in the Ca 2+ -binding 
motif appear to have coevolved. Moreover, the two group- 
3 sites, 161D and 176D, were separated in both primary 
and tertiary structure. Gloor et al. [51] showed that coevo- 
lution between distant sites occurred in sites proximal to 
regions with critical functions, where coevolution main- 
tained the structural characteristics of these regions and 
consequently maintained protein conformational and 
functional stability. Hence, the coevolution between 161D 
and 176D might contribute to maintaining the structural 
characteristics of caleosin. 

Critical amino acid sites of caleosin 

The predicted sites of functional divergence, positive selec- 
tion, and coevolution detected by bioinformatics software 
(DIVERGE, PAML, and CAPS, respectively) are all labeled 
in Figure 3. Moreover, the positions of a calcium-binding 
motif, a proline-knot like motif, and four phosphorylation 
sites (one tyrosine kinase and three casein kinase II phos- 
phorylation sites) are indicated at the tops of the sequences 
based on the proposed secondary structure from Chen 
et al. [11]. The two proline positions of the proline-knot 
motif were both indicated as functional divergence sites. 
Among the 34 positions behind the proline-knot motif, 32 



(94.12 %) were considered as functional divergence sites. 
Including another four sites in the proline-knot motif, 36 of 
38 (94.74 %) sites were functionally divergent, and all seven 
sites under positive selection were also located in this re- 
gion. Interestingly, the analysis showed that the critical 
amino acid sites of functional divergence and positive selec- 
tion were mainly localized in the C-terminal domain of 
caleosin, while revolutionary sites were concentrated in 
the calcium-binding region (Figure 3). 

Both the structural stability and thermostability of arti- 
ficial oil bodies have been shown to be slightly or se- 
verely reduced under the truncation of the amphipathic 
a-helix (15 residues) or proline-knot subdomain (21 res- 
idues) of recombinant caleosin, and the whole central 
hydrophobic domain of 36 (15 + 21) residues is thus cru- 
cial for the stability of oil bodies [12,52,53]. The pre- 
dicted a-helix and proline-knot are labeled in Figure 3, 
along with the two predicted functional divergence sites 
and four positive selection sites coinciding with this re- 
gion. A disulfide bridge between two cysteine residues 
(C221 and C230 of AtCLOl) reported by Purktova et al. is 
also shown in Figure 3 [12]. That previous work deter- 
mined that the two cysteine sites are the only cysteine resi- 
dues in the caleosin primary structure and that C221 is 
highly conserved in all known caleosin sequences, while 
C230 is not as strictly conserved. However, in our align- 
ment of eight caleosins from A. thaliana, we observed that 
this situation existed in only four members {AtCLOl, 
AtCL02, AtCL03, and AtCL06) of sub IV (Figure 3). The 
cysteine sites of the caleosins in sub III are scattered across 
other locations. The distinct difference at crucial amino 
acid residues between subfamilies might contribute to an- 
swer the puzzle of how the highly conserved caleosin iso- 
forms execute several diverse functions. Moreover, a wider 
physiological role for caleosins in seeds was suggested by 
the demonstration that the Arabidopsis oil-body associated 
AtCLOl protein has a calcium-dependent haem-oxygenase 
activity that is regulated by one or two conserved ferric- 
binding histidine residues 70-His and 138-His [17]. In 
Figure 3, 70-His and 138-His are highlighted by blue 
frames. The 70-His site is included in the calcium binding 
EF-hand motif and is also under coevolution with other 
sites around this motif, while the 138-His site is predicted 
to have undergone functional divergence during evolution. 
Both of these sites appear to be crucial to caleosin, but 
their exact roles require further clarification. All the pre- 
dicted critical amino acid sites should give a firm basis for 
forthcoming studies of caleosin. 

Conclusion 

This study provides a comparative genomic analysis ad- 
dressing the phylogenetic relationships and evolution of the 
caleosin gene family in 15 species representing five major 
lineages. The results of the phylogenetic analysis revealed 
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that five well-conserved subfamilies exist in plants and that 
all caleosins may have originated from a common ancestor 
of green plants. Among these subfamilies, sub II coincides 
with the distinct P-caleosin isoform recently identified in 
the pollen oil bodies of lily. Motif analysis detected one 
motif, designated as motif 6, coincident with the N-terminal 
appendix discovered by previous research, which may be re- 
sponsible for the classification of caleosins. Tandem duplica- 
tion was likely to have been the dominant mechanism of 
gene amplification during the expansion of the caleosin fam- 
ily in monocots, while the dicot caleosin family probably ex- 
panded primarily through segmental duplication. Functional 
divergence analysis suggested that significant site-specific se- 
lective constraints may have acted on many caleosin genes 
after gene duplication, leading to subfamily-specific func- 
tional evolution, and functional divergence may reflect the 
existence of long-term selective pressures. Further analysis 
showed that the majority of the detected critical amino acid 
sites for functional divergence and positive selection were 
located in the C-terminal domain of caleosin. Finally, co- 
evolutionary analysis highlighted that the majority of de- 
tected coevolving sites were located in the Ca 2+ -binding 
region. The data obtained from our investigation will con- 
tribute to an improved understanding of the complexity of 
the caleosin gene family and of its function and evolution 
in green plants. 

Methods 

Identification of caleosin genes 

Caleosin gene families were identified from 15 completely 
sequenced genomes representing the plant lineage from 
unicellular green algae to multicellular plants. The search 
was performed using "caleosin" as a keyword in the Phyto- 
zome (http://www.phytozome.org) database; eight Arabi- 
dopsis thaliana caleosin genes were first retrieved and 
then used as a query sequence in BLAST searches 
(BLASTP and TBLASTN) and the sequences retrieved 
from corresponding plant-genome annotation resources 
were analyzed. Partial and redundant sequences were ex- 
cluded. Sequences were obtained from the following 
groups and species: the unicellular green algae Chlamydo- 
monas reinhardtii and Volvox carteri (http://www.phyto- 
zome.org); the moss Physcomitrella patens (http://genome. 
jgi-psf.org/); the lycophyte Selaginella moellendorffii (http:// 
genome.jgi-psf.org/); the monocotyledonous angiosperms 
Brachypodium distachyon (http://www.brachypodium.org), 
Oryza sativa (http://rapdb.dna.affrc.go.jp/), Setaria italica 
(http://www.phytozome.org), Zea mays (http://wwwmaize- 
sequence.org), and Sorghum bicolor (http://genome.jgi-psf. 
org/); and the dicotyledonous angiosperms Arabidopsis 
thaliana (http://www.arabidopsis.org/), Citrus sinensis 
(http://www.phytozome.org), Cucumis sativus (http://gen- 
ome.jgi-psf.org/), Brassica rapa (http://brassicadb.org/ 
brad/), Glycine max (http://genome.jgi-psf.org/soybean), 



and Populus trichocarpa (http://genome.jgi-psf.org/). In 
addition, the Pfam (http://pfam.sanger.ac.uk/) and SMART 
(http://smart.embl-heidelberg.de/) databases were employed 
to detect conserved domains with caleosin or caleosin-like 
protein candidates. Finally, we manually refined the search 
results based on the Pfam (PF05042) and SMART analyses 
to further reduce hits with partially conserved functional 
domains and other false positives. 

Phylogenetic analysis 

Multiple sequence alignment was the first step in the 
phylogenetic analysis; alignment quality may have a sig- 
nificant impact on the final phylogenetic tree [54,55]. 
Amino acid sequences of caleosin genes were aligned 
using Clustal X [22,56] with the default parameters, ex- 
cluding poorly aligned positions, gap positions, and diver- 
gent regions. Phylogenetic analyses were performed with a 
neighbor-joining (N-J) method using MEGA5 [23] and the 
reliability of interior branches was assessed with 1000- 
bootstrap resampling. To confirm the tree topologies, 
maximum likelihood (ML) and Bayesian trees were con- 
structed using PhyML 3.0 [57] and MrBayes 3 [58,59], re- 
spectively, the results of which showed similar topology 
with only minor modifications at deep nodes (data not 
shown). Finally, the N-J phylogenetic tree was determined 
with sub V as the root (Figure 1A). 

Exon-intron structure and motif analysis 

Diagrams of exon-intron structure were obtained using the 
online Gene Structure Display Server (GSDS: http://gsds.cbi. 
pku.edu.cn) with coding sequence (CDS) and genomic se- 
quence [24]. The MEME program (http://meme.sdsc.edu) 
[25,60] was used to identify motifs in the candidate caleosin 
protein sequences. MEME was run locally with the following 
parameters: number of repetitions = zero or one, maximum 
number of motifs = 6, and optimum motif width constrained 
between 6 and 50 residues. 

Estimating the age of duplicated paralog gene pairs 

Pairwise alignment was performed to calculate the age of 
segmentally duplicated caleosin pairs using Clustal X [22] . 
The duplication age was estimated by the number of syn- 
onymous substitutions per synonymous site (Ks). The Ks 
values of the duplicated caleosin gene pairs were obtained 
using K-Estimator [61]. Based on 6.5 x 10" 9 synonymous 
substitutions per year (A) for rice [62], 6.1 x 10" 9 for soy- 
bean [63], 1.5 xlO" 8 for Arabidopsis [64], 9.1 x 10" 9 for 
poplar [65], and 1.4 x 10 -8 for Brassica [66]. The approxi- 
mate age (T) of duplication events was calculated for the 
caleosin gene pairs using the estimated Ks values as T = 
Ks/2X [67]. 

Fourfold synonymous third-codon transversion (D 4DTv ) 
distance was calculated to assess the genetic distance be- 
tween tandem duplicated pairs. The paralogous proteins 
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for each species were aligned pairwise using Clustal X 
[22], and corresponding codon alignments were created 
using the online program PAL2NAL (http://www.bork. 
embl.de/pal2nal/) [68]. The corresponding codons were 
extracted from these alignments and used to calculate the 
D4DTV distance between each aligning pair. D 4DTv values 
ranged from zero for recently duplicated peptides to -0.5 
for paralogs with an ancient evolutionary past. 

Functional divergence analysis 

To estimate the level of functional divergence and to pre- 
dict important amino acid residues among caleosin sub- 
families, the coefficients of Type-I and Type-II functional 
divergence (81 and 6 n ) between any two clusters were cal- 
culated using the methods suggested by Gu [45,49] as im- 
plemented in DIVERGE2 [38]. This method is based on 
maximum-likelihood procedures to estimate significant 
changes in the site-specific shift of evolutionary rate or 
amino acid properties after the emergence of two paralo- 
gous sequences. Type-I divergence designates amino acid 
configurations that are very conserved in cluster 1 but 
highly variable in cluster 2, or vice versa, implying that 
these residues have experienced altered functional con- 
straints [48]. Type-II divergence designates amino acid 
configurations that are very conserved in both genes but 
whose biochemical properties are very different, implying 
that these residues may be responsible for functional spe- 
cification [47]. Values of $i or 6 n that are significantly >0 
indicate site-specific altered selective constraints or a rad- 
ical shift in amino acid physiochemical properties after 
gene duplication and/or speciation [38,47]. A site-specific 
profile based on posterior probability (Q/c) was used to 
predict critical amino acid residues that were responsible 
for functional divergence. In this analysis, large Qk values 
indicated a high possibility that the functional constraint 
(or evolutionary rate) and/or the radical change in a site's 
amino acid properties differed between two clusters [38] . 

Estimating the pattern of nucleotide substitution and 
positive-selection sites 

The nonsynonymous to synonymous substitution ratio 
(dN/dS) for orthologous groups was computed by "codeml" 
in PAML [69]. The one-ratio model (MO) assumes one co 
(dN, nonsynonymous/dS, synonymous) ratio for all sites. In 
the discrete model (M3), the probabilities (pO, pi, and p2) 
of each site being subjected to purifying, neutral, and posi- 
tive selection, respectively, and their corresponding co ratios 
(coO, col, and co2) were inferred from the data. The Beta (|3) 
model (M7) is a null test for positive selection assuming a |3 
distribution with co between 0 and 1. Finally, the p and co 
model (M8) adds one extra class with the same ratio, col 
[70]. In this study, two pairs of site models were chosen in 
PAML to test positive selection. Analyses of real data and 
computer simulations [71,72] suggested that two pairs of 



site models were particularly effective [37]; the first pair of 
models was MO (one-ratio) and M3 (discrete); the second 
pair was M7 (|3) and M8 (|3 and co). Subsequent likelihood 
rate comparisons were performed for MO with M3 and M7 
with M8 to determine which models better fit the data. The 
difference in log-likelihood between the models was com- 
pared with a chi-square distribution with n degrees of free- 
dom, where n was the difference between the numbers of 
parameters in the two models. A significantly higher likeli- 
hood of the alterative model compared to the null model 
suggested positive selection. Finally, the Bayes Empirical 
Bayes (BEB) approach was used to calculate the posterior 
probability that each site belonged to the site class of posi- 
tive selection under each model [73]. 

Analysis of caleosin coevolution 

To identify coevolution between amino acid sites, a Coevo- 
lution Analysis using Protein Sequences (CAPS) was per- 
formed with PERL-based software [74]. CAPS provides a 
mathematically simple and computationally feasible means 
of comparing the correlated variance of evolutionary rates 
at two amino acid sites corrected by time since divergence 
of the protein sequences to which they belong. Blosum- 
corrected amino acid distance was used to identify amino 
acid covariation. The phylogenetic sequence relationships 
were used to remove phylogenetic and stochastic depend- 
encies between sites. The 3D protein structure was used to 
identify the nature of the dependency between coevolving 
amino acid sites. 
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